Goto

Collaborating Authors

 difference estimator


PPI is the Difference Estimator: Recognizing the Survey Sampling Roots of Prediction-Powered Inference

arXiv.org Machine Learning

Prediction-powered inference (PPI) is a rapidly growing framework for combining machine learning predictions with a small set of gold-standard labels to conduct valid statistical inference. In this article, I argue that the core estimators underlying PPI are equivalent to well-established estimators from the survey sampling literature dating back to the 1970s. Specifically, the PPI estimator for a population mean is algebraically equivalent to the difference estimator of Cassel et al. (1976), and PPI plus corresponds to the generalized regression (GREG) estimator of Sarndal et al. (2003). Recognizing this equivalence, I consider what part of PPI is inherited from a long-standing literature in statistics, what part is genuinely new, and where inferential claims require care. After introducing the two frameworks and establishing their equivalence, I break down where PPI diverges from model-assisted estimation, including differences in the mode of inference, the role of the unlabeled data pool, and the consequences of differential prediction error for subgroup estimands such as the average treatment effect. I then identify what each framework offers the other: PPI researchers can draw on the survey sampling literature's well-developed theory of calibration, optimal allocation, and design-based diagnostics, while survey sampling researchers can benefit from PPI's extensions to non-standard estimands and its accessible software ecosystem. The article closes with a call for integration between these two communities, motivated by the growing use of large language models as measurement instruments in applied research.


Practical Improvements of A/B Testing with Off-Policy Estimation

arXiv.org Machine Learning

We address the problem of A/B testing, a widely used protocol for evaluating the potential improvement achieved by a new decision system compared to a baseline. This protocol segments the population into two subgroups, each exposed to a version of the system and estimates the improvement as the difference between the measured effects. In this work, we demonstrate that the commonly used difference-in-means estimator, while unbiased, can be improved. We introduce a family of unbiased off-policy estimators that achieves lower variance than the standard approach. Among this family, we identify the estimator with the lowest variance. The resulting estimator is simple, and offers substantial variance reduction when the two tested systems exhibit similarities. Our theoretical analysis and experimental results validate the effectiveness and practicality of the proposed method.


A Framework for Adversarial Streaming via Differential Privacy and Difference Estimators

arXiv.org Artificial Intelligence

Streaming algorithms are algorithms for processing large data streams while using only a limited amount of memory, significantly smaller than what is needed to store the entire data stream. Data streams occur in many applications including computer networking, databases, and natural language processing. The seminal work of Alon, Matias, and Szegedy[AMS99] initiated an extensive theoretical study and further applications of streaming algorithms. In this work we focus on streaming algorithms that aim to maintain, at any point in time, an approximation for the value of some (predefined) real-valued function of the input stream. Such streaming algorithms are sometimes referred to as strong trackers. For example, this predefined function might count the number of distinct elements in the stream.


Scalable MCMC for Large Data Problems using Data Subsampling and the Difference Estimator

arXiv.org Machine Learning

We propose a generic Markov Chain Monte Carlo (MCMC) algorithm to speed up computations for datasets with many observations. A key feature of our approach is the use of the highly efficient difference estimator from the survey sampling literature to estimate the log-likelihood accurately using only a small fraction of the data. Our algorithm improves on the $O(n)$ complexity of regular MCMC by operating over local data clusters instead of the full sample when computing the likelihood. The likelihood estimate is used in a Pseudo-marginal framework to sample from a perturbed posterior which is within $O(m^{-1/2})$ of the true posterior, where $m$ is the subsample size. The method is applied to a logistic regression model to predict firm bankruptcy for a large data set. We document a significant speed up in comparison to the standard MCMC on the full dataset.